49 research outputs found
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Iron Behaving Badly: Inappropriate Iron Chelation as a Major Contributor to the Aetiology of Vascular and Other Progressive Inflammatory and Degenerative Diseases
The production of peroxide and superoxide is an inevitable consequence of
aerobic metabolism, and while these particular "reactive oxygen species" (ROSs)
can exhibit a number of biological effects, they are not of themselves
excessively reactive and thus they are not especially damaging at physiological
concentrations. However, their reactions with poorly liganded iron species can
lead to the catalytic production of the very reactive and dangerous hydroxyl
radical, which is exceptionally damaging, and a major cause of chronic
inflammation. We review the considerable and wide-ranging evidence for the
involvement of this combination of (su)peroxide and poorly liganded iron in a
large number of physiological and indeed pathological processes and
inflammatory disorders, especially those involving the progressive degradation
of cellular and organismal performance. These diseases share a great many
similarities and thus might be considered to have a common cause (i.e.
iron-catalysed free radical and especially hydroxyl radical generation). The
studies reviewed include those focused on a series of cardiovascular, metabolic
and neurological diseases, where iron can be found at the sites of plaques and
lesions, as well as studies showing the significance of iron to aging and
longevity. The effective chelation of iron by natural or synthetic ligands is
thus of major physiological (and potentially therapeutic) importance. As
systems properties, we need to recognise that physiological observables have
multiple molecular causes, and studying them in isolation leads to inconsistent
patterns of apparent causality when it is the simultaneous combination of
multiple factors that is responsible. This explains, for instance, the
decidedly mixed effects of antioxidants that have been observed, etc...Comment: 159 pages, including 9 Figs and 2184 reference
Even Faster Exact k-Means Clustering
A naïve implementation of k-means clustering requires computing for each of the n data points the distance to each of the k cluster centers, which can result in fairly slow execution. However, by storing distance information obtained by earlier computations as well as information about distances between cluster centers, the triangle inequality can be exploited in different ways to reduce the number of needed distance computations, e.g. [3, 4, 5, 7, 11]. In this paper I present an improvement of the Exponion method [11] that generally accelerates the computations. Furthermore, by evaluating several methods on a fairly wide range of artificial data sets, I derive a kind of map, for which data set parameters which method (often) yields the lowest execution times.publishe